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ABSTRACT 

Literature on optical and infrared microvariability in Active Galactic Nuclei (AGNs) reflects a diver- 
sity of statistical tests and strategies to detect tiny variations in the lightcurves of these sources. Compari- 
son between the results obtained using different methodologies is difficult, and the pros and cons of each 
statistical method are often badly understood or even ignored. Even worse, not properly tested methodolo- 
gies are becoming more and more common, and biased results may be misleading to realize the origin of 
the AGN microvariability. This paper intends to point future research on AGN microvariability to the use 
of powerful and well tested statistical methodologies, providing a reference for choosing the best strategy 
to obtain unbiased results. Lightcurves monitoring have been simulated for quasars, reference and com- 
parison stars. Changes for the quasar lightcurves include both Gaussian fluctuations and linear variations. 
Simulated lightcurves have been analyzed using tests, F tests for variances, One-Way Analysis of 
Variances and C-statistics methodologies. Statistical Type I and Type II errors, which indicate the robust- 
ness and the power of the tests, have been obtained in each case. One-Way Analysis of Variances and 

show to be powerful and robust estimators for microvariations, while the C-statistics is not a reUable 
methodology and its use should be avoided. 

Subject headings: methods: data analysis, statistical - techniques: photometric - galaxies: active. 



1. Introduction 

Observational techniques for optical monitoring of 
AGNs and photometric studies of variable stars have 
many similarities, but they differ in the sense that vari- 
able stars often show periodic lightcurve fluctuations, 
while AGNs do not. Although most variable AGNs 
(blazars) show microvariability or transient large am- 
plitude variations in a few hours, the unpredictability 
of these changes in brightness and the difficulty to be 
confirmed by other observers have been a perennial 
cause o f incredulity and skep t icism since the first re- 
port by Matth ews & Sandagd (Il963h of a 15 min mi- 
crovariability event of amplitude AV = 0.044 in 3C 48. 
In less variable objects, such as quasars, the amplitude 
of the microvariations are usually lower and close to 
the limit of detection. 

In order to increase the confidence on the validity 
of variability reports, a number of statistical tests have 
been proposed to prove the reliability of the measure- 
ments. A methodology that has been widely used is 
the test for variances that compares a sample vari- 



ance obtained from a a suspected variable target with 
a theoretically calculated variance for a non-variable 
object, taking into account all the possib le sources of 
error. For example, iPica & Smith (Il983h use a X and 
the so called Q-statistics, based on the difference be- 
tween the brightest and the dimmer observations, to in- 
vestigate long term variability in approximately 6000 
photographic observations of 130 AGNs of different 
types monitored during 13 years. After the introduc- 
tion at the end of the seventies of the CCDs in as- 
tronomical observations, differential photometry be- 
came a very reliable technique for short time resolved 
lightcurve studies. In differential photometry, the flux 
of the target object is divided by a reference star in 
the same CCD frame. As the target and the reference 
star images have been obtained simultaneously at the 
same air mass and identical instrumental and weather 
conditions, the flux ratio is considered to be very reli- 
able. It is common practice to compare the differential 
lightcurves of the target an at least one non variable 
field star, denoted as comparison star. 
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A remarkable effort to update statistical techniques 
used in photoelectric photometer and photographic 
plate measurements, taking advantage of the better and 
faster respon s e of C CD detectors, was carried out by 
Howell et al. who compared the variances be- 

tween the object and a comparison star using the F dis- 
tribution. The F test compares two sample variances, 
one for the suspected variable source and the other for 
the non-variable comparison star. These authors also 
introduced the F factor to account for scale differences 
between the variances of the target and the compari- 
son star due to photon noise. However, based tests 
have nevertheless endured as they are well known and 
simple estimators for flux variations. 

19991) 



Jang & Milled (Il997h and iRomero et al 



have proposed a new test to analyze microvariability 
based on the ratio C between the target and the com- 
parison star standard deviati ons, rather than varian ces. 
This C-statistics resembles IPica & SmifliI (Il983b Q- 
statistics, but is less sensible to the spurious effect of a 
single discrepant point. In the last decade C-statistics 
has become very popular and several researchers have 
adopted this statistical methodology to study quasars 
(e.g. StaUn et al. 2004a; Gupta et al. 20 08) and blazars 
(e.g. Ixie et al.|[2()()4tlAndruchow et al.ll2005b . 

A different methodology to anal yze differential 
hghtcurves, has been proposed by Ide Diego et al. 
(119981) . These authors use the One-Way Analysis 



Of Variance (ANOVA) test for studie s of quasar mi- 
crovariability. Using this technique, Ide Diego et al, 



were the first to claim that microvariability events were 
as frequent in radio quiet as in radio loud quasars (ex- 
cluding blazars). ANOVA has also been applied in 
other studies of AGN variability ([Ramirez et a i l2004 
Villforth et aP l2009t liarmrez et al.1 l2009l). However 



becaus e the novel results reported by Ide Diego et al. 



(Il998h were unexpected at the time and because 
they were difficult to compare with previous stud- 
ies, th e ANOVA test has still not gained fu ll accep- 
tance dRomero et al.lll999l:ICarini et alj|2007h . In this 
paper, independent runs of data are simulated using 
Monte Carlo technique, and analyzed using One-Way 
ANOVA and other statistical techniques. 

Despite their generic name, which can be mislead- 
ing, ANOVA tests are designed to detect differences 
between several sample means, rather than between 
sample variances. Thus, such tests can be consid- 
ered a generalization of the Student f-tests for differ- 
ences between two sample means. It is worth notic- 
ing that tests for means can distinguish smaller differ- 



ences than tests for variances (see ®, and thus it is 
expected that ANOVA improves the detection of mi- 
crovariability events. ANOVA tests are used, for in- 
stance, in Experimental D esign statistical methodolo- 
gies (e.g. IBox et al.ll2005h . which deliberately impose 
one or more conditions on different groups of data in 
the interest of observing the response. These method- 
ologies radically differ from those common in Astron- 
omy, which involve collecting and analyzing data on 
the run and where external conditions cannot be ma- 
nipulated. Therefore, after applying the same objec- 
tive method to this and other statistical tests, we will 
be able to effectively establish the reliability and ad- 
vantages of the ANOVA-test. 

This paper presents a comparison of the outcome 
of different analysis strategies for the detection of low 
amplitude microvariations when reliable astronomical 
differential photometric data are available. By reli- 
able data it is understood data characterized by ran- 
dom errors that are not affected by systematical errors. 
Dealing properly with systematical observational er- 
rors would require, first, to detect that the data are in- 
deed affected by these errors, second, to derive an un- 
derstanding of their cause, and third, to properly cor- 
rect for these systematics by changing and fine tuning 
the observational set-up, to the extent that this is possi- 
ble. In this Paper, systematical effects will be entirely 
disregarded. 

To analyze the data, several procedures based on 
tests, F tests for variances, One-Way ANOVA 
and C-statistics will be considered in turns. Even 
though these consist of common statistical methodolo- 
gies, there exist many different implementation strate- 
gies of these test and, in principle, new alternative tests 
could be carried out. Strictly speaking, some of the de- 
rived inferences or comparisons established between 
the various tests may be only valid for the particular 
cases considered here. Furthermore, we cannot rule 
out that different observational circumstances or set- 
up implementations with respect to those envisage in 
this Paper might result in altogether different results. 

This paper is organized in the following way: the 
simulation procedure is described in ^ results are 
shown in ^ a discussion and comparisons between 
the tests are presented in ^ and the conclusions are 
summarized in ^ In ^Ajthe interested reader can also 
find the mathematical description of each test, along 
some comments on their use and validity. An interest- 
ing implementation of the One-Way ANOVA to im- 
prove the detection of microvariability in the AGN 
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Table 1 : Parameters used in the simulations. 





Electron counts 


Total 


Description 


Signal 


Noise 


S/N 


Detector 




103 




Sky 


23,287 


152 




y = 17 


32,292 


179 


126 


y = 15 


203,766 


451 


418 



lightcurves is discussed in ^B] 

2. Lightcurve simulations 

In order to accurately analyze the power and robust- 
nes|l]of the different statistical methodologies, simula- 
tions were performed for the lightcurves of the quasar, 
a references star and two cornparison star For easi- 
ness, the Instrument Simulator^ software of the Mex- 
ican Observatorio Astronomico Nacional was used to 
obtain basic data for the simulations. The input argu- 
ments were: the 1.5 m telescope, the SITel detector, 
2x2 binning, V filter, 1.5" seeing, 3" aperture, 60s 
exposure time, and magnitudes y = 17 and 15 for the 
quasar and the reference star, respectively. For sim- 
plicity, the comparison stars have been chosen to have 
the same magnitude as the quasar. The output includes, 
among others, the following parameters: the signal to 
noise ratio, and the object, sky and detector (total read- 
out) noises. From these parameters, the object and the 
sky electron counts were obtained considering Poisson 
distribution {i.e. photon shot noise). The parameters 
used in the simulations are shown in Table [T] Column 
(1) describes the source associated to the parameters; 
columns (2) and (3) indicate the electron counts for 
the signal and noise, respectively; column (4) shows 
the signal noise ratio (SNR). 

For a telescope of 1.5 and a fairly SNR larger 
than 100, 1 min exposures are reasonable. Longer ex- 
posures may saturate bright objects and stars which 
might be used as reference in differential photome- 
try. Thus, every simulation comprises 150 x 1 min ex- 
posures during 5 h of monitoring, with 1 min lag be- 
tween exposures to account for CCD read-out. For 



'The power of a statistical test is the probability that the test will 
reject a false null hypothesis. A robust statistical technique is one 
that performs well even if its assumptions are somewhat violated by 
the inherent properties of the sampled population. 

^The Instrument Simulator of the OAN has been developed by 
Alan Watson and can be accessed through the OAN web page 
[http://132.248 . 4 . 258/$\sim$resast/simuladorj 
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Fig. 1. — Simulated raw curves (a) and Differential 
lightcurves (b) for V - 11 quasar (ik) and comparison 
star {+, with an offset of 0. 1 mag). At the middle of the 
simulation, the objects are crossing the zenith, where 
the atmospheric absorption has been set at 20%. A 
reference star with y = 15 has been used to calibrate 
the observations. The amplitude of the variability in 
this case was 0.0267, and the peak has a FWHM of 
60 min. 

more realistic simulations, atmospheric attenuation in 
the y band (Ay = 0.2 mag/air mass) has been taken 
into account. The object is supposed to pass through 
the zenith in the middle of the run {i.e. 2.5 h after the 
monitoring began). Photon noises were generated for 
the object, the reference and comparison stars, as well 
as for a constant brightness sky. Finally, white gaus- 
sian noise was also generated to account for the CCD 
read-out. The photometric accuracy for the differen- 
tial lightcurves of the quasar and the comparison star 
is about 0.01 mag. 

Two runs of simulations were performed. One of 
the runs considered a gaussian shaped variation in the 
quasar flux. The duration of the variations are de- 
scribed by their FWHM that are allowed 60 discrete 
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Table 2: Results of the simulations. 



Gaussian peak variation Linear variation 



Test 


a 


Type I 


Detections 


Type II 


Type I 


Detections 


Type II 


test 


0.001 


10 


1614 


4386 


2 


2278 


3722 


Idem (std) 


0.001 


125 


1748 


4252 


129 


2385 


3615 


One-Way ANOVA 


0.001 


3 


2187 


3813 


4 


3043 


2957 


Idem (30 min lags) 


0.001 


2 


1086 


4914 


11 


1668 


4332 


f test {a = 0.1%) 


0.001 


9 


837 


5163 


5 


1302 


4698 


Utm{a = 1%) 


0.01 


63 


1443 


4557 


61 


2050 


3950 


C-statistics 


0.01 








6000 








6000 



values in the range between 1 and 60 min {i.e. in steps 
of 1 min). This range for the duration of the variations 



lectins spikes (see Sasar et alj 


19961; deDieeoetal. 


1998; 


Gopal-Krishna et alj 200C 


; Stalin et al. 


2004a). 



The peak of the variation was allowed to be centered 
between 120 and 180 min from the beginning of the 
monitoring. This range for the peak center ensures 
that the whole variation is contained in the data set. 
Finally, for each of the 60 values set for the duration 
of the variation, 100 simulations were performed al- 
lowing random amplitudes up to 3% of the quasar flux 
{i.e. ~ 0.03 mag). Therefore, the total number of simu- 
lations were 6000 (60x 100), of 150 photometric points 
each. Fig. [T^ shows the final simulated raw curve for 
the quasar and comparison star for a Gaussian varia- 
tion, while Fig. [TJ) shows the same curves after cali- 
bration by the observations of the reference star The 
standard deviation of the differential curve of the com- 
parison stars shows that the photometry is accurate up 
to 0.009 mag. 

For the other run, the lightcurves of the objects 
present a constant (linear) flux variation, as shown in 
Fig.|2] The amplitudes of these variations, measured as 
the difference between the first and the last data point, 
were also random valued up to 3% of the quasar flux. 
As in the gaussian peak case, a total number of 6000 
simulations of 150 data points each were performed. 
In both cases, Gaussian peak and linear variation, all 
the fluxes and estimated errors were converted into 
magnitudes for the analysis. 

3. Results 

The results of the statistical analysis of the simula- 
tions are summarized in Table |2l Column ( 1 ) identi- 
fies the test {see below); column (2) indicates the sig- 
nificance level a; columns (3), (4) and (5) show the 



number of Type I errors found in the analysis of one of 
the comparison stars, the number of detections of mi- 
crovariations in the quasar lightcurves, and the num- 
ber of Type II errors for the quasars, respectively, for 
the 6000 simulations for the Gaussian peak variation; 
columns (6), (7) and (8) repeat the same numbers but 
for linear variations. 

The first item in Table |2]is the^^ test described in 
^A.2I To perform this test, the true error distributions 
introduced in the simulations (photon and white gaus- 
sian noises) have been considered to estimate the er- 
ror of each individual data point. In the second test, 
a similar analysis has been performed, but con- 
sidering the standard deviation of the comparison star 
instead of the individual errors of each measurement 
(note that in this case Type I errors have been calcu- 
lated from the other comparison star). The third test 
is One-Way ANOVA, performed by grouping the data 
in sets of 5 individual observations {see description in 
^A3\) . The fourth test is also One-Way ANOVA but 
with 30 min lag between group sampling (and thus it 
is the only test that considers only a fraction of 1/3 of 
the simulations). The results of two F tests for vari- 
ances {see ^A.ll for a description) are reported in the 
fifth and the sixth items, the former at the significance 
level a - 0.001 (or 0. 1 %) to compare with the previous 
tests, and the later ata - 0.01 (or 1 %) to compare with 
the C-statistics. Finally, the fifth test is C-statistics as 
described in ^A.4t 



The significance level a is a probability set a priori 
by the researcher that a test yields, only by chance, a 
result at least as extreme as the one observed. Note that 
ANOVA andx^ tests have been performed for a signif- 
icance level of 0.1%, that corresponds to the usual de- 
tection limit of 3cr. On the other hand, the significance 
level of the C-statistics is set at 1%, or a detection 
limit of 2.576cr, whic h is the level cornmonly defined 
for this test {e.g. iJang & MiUeilll997l: iRomero et aP 
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Fig. 2. — Simulated raw curves (a) and Differential 
Ughtcurves (b) for V - \1 quasar (iir) and comparison 
star (+, with an offset of 0. 1 mag). At the middle of the 
simulation, the objects are crossing the zenith, where 
the atmospheric absorption has been set at 20%. A 
reference star with V = 15 has been used to calibrate 
the observations. The amplitude of the variability in 
this case was 0.0113, and it has a steady increase in 
flux during the monitoring. 



19991) . For the F test, both significance level values 
are considered: a - 0.1% to compare with the;^'^ and 
ANOVA tests, and a - 1% to compare with the C- 
statistics, as the F test is considered as an alternative 
to this methodology {see ^A.4l l. 

Type! errors are due to the rejection of a true null 
hypothesis (i.e. rejecting the non-variability hypothesis 
for a non-variable object). For an unbiased test, the ac- 
tual number of Type I errors depends only on the num- 
ber of data sets examined and the significance level of 
the test. In this case, the number of Type I errors has 
been obtained testing the simulated differential curves 
of a (non-variable) comparison star For a significance 
level of say 0.1% and 6000 simulations, we should ex- 



pect around 6 spurious detections. Considering a Bi- 
nomial distribution for the number of Type I errors, we 
expect that its actual number for a given test will be 
between and 13, and in most cases in the 6 + 3 in- 
terval. If the number of Type I errors for the compari- 
son star is significantly different from the expected fre- 
quencies, it is evidence that the actual significance of 
the test differs from its nominal set value and the test 
is not reliable. 

The number of detections in Table |2] indicates how 
many tests have succeeded in detecting variability in 
the quasar simulated differential lightcurves, and it is 
also a measure of the power of the test, which gener- 
ally varies as a function of the data set characteristics. 
On the other hand. Type II errors are due to the ac- 
ceptance of a false null hypothesis (i.e. accepting the 
non-variability hypothesis for a variable object). Note 
that in all the simulations the quasar varied. There- 
fore, as the number of Type I errors is low with respect 
to the number of detections, the number of Type II er- 
rors would be (approximately) 6000 less the number 
of detections 

From Table |2] we see that the test performed 
taken into account the actual error distribution, One- 
Way ANOVA, One-Way ANOVA with 30 min lags, 
and the F test at a = 0.1%, all show a number of 
Type I errors in accordance with expectations. The F 
test at ff = 1 % and the C-statistics have a lower signif- 
icance level and accordingly the number of expected 
Type I errors would be 60. The results for the F test 
at a - 1% agree with this expectation, nevertheless 
C-statistics consistently produced neither Type I er- 
rors nor detections, in accordance with the arguments 
presented in ^A.4| In fact, in these simulations, C- 
statistics is always a factor > 2 below its critical value 
(2.576) that determines the boundary between reject- 
ing or accepting the null hypothesis. 

The test, performed considering the standard 
variation of the comparison star instead of the actual 
errors for each observation in the quasar lightcurve, 
shows a number of Type I errors much larger than ex- 
pected (Table|2]l. This is consequence of employing a 
wrong methodology. As explained in ^A.4I not con- 
sidering the number of degrees of freedom in the es- 
timation of the standard variation of the lightcurve of 



more accurate calculation would take into account the fraction 
of Type I en'ors included in the number of detections. If Type I er- 
rors were frequent, mixtures of honest detections and Type I errors 
should also be present. 
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the comparison star produces a biased statistics. The 
increase in the number of detections with respect to 
the previous test is not significant. As the num- 
ber of Type I errors is still much less than the num- 
ber of Type II errors, it is possible to estimate the 
number of spurious detections of variability in the 
quasar lightcurves (i.e. detections not related to the 
actual variation of the lightcurves). In this case the 
number of spurious detections will be approximately 
125 X 4252/6000 = 89, while a similar calculation 
produces 7 spurious detection for the previous test. 
Thus, even if the biased statistics would affect only the 
number of Type I errors and not Type II errors (which 
probably will be also affected), we would expect a 
diff'erence of about 80 detections between both tests, 
counting for most of its factual value of 134. 

On the contrary, for the F tests the number of 
Type I errors agrees with the expected number, as com- 
mented above. However they show a rather small num- 
ber of detections in comparison with the other tests 
(see Table |2]l. Thus, the detections for the F test at 
a - 0.1% are well below the results for the One- 
Way ANOVA with 30min lags, even if the number 
of data points for the later test is a factor 3 smaller 
than for the F test. In the case of the F test at 
a - 1%, the significance level is low enough that a 
larger number of detections would be expected with 
respect, for example, to the test at a significance 
level of 0.1%; yet the opposite is true. There are 
two possible explanations for these results, one is the 
non-robustnes s of the F test for non-Gaussian dis- 
tributed data (l LehmannllT986 l. §5.4), and the other is 
that the test has intrinsically less power than x^ and 
ANOVA. Indeed, some of the comparison star differ- 
ential lightcurves are not well fitted using a Gaussian 
profile, as shown in Fig. |3] perhaps as a consequence 
of the underlying Poisson distribution associated with 
flux measurements. To investigate this possibility, a 
set of Kolmogorov- Smrrnov for Goodnes s-of-Fit tests 
of Gaussianity (e.g. IWall & Jenldnsl2003[ §5.3.2) was 
performed on the distribution of the simulated Gaus- 
sian Peak data for each of the 6000 lightcurves of the 
comparison star. The significance level of the test 
was fixed at 5%; therefore it was expected around 
300 Type I errors if the data was fairly Gaussian dis- 
tributed. The actual number of differences found by 
the Kolmogorov-Smirnov test were 296, which rules 
out the non-Gaussian distribution explanation. Then, 
the differences in detecting lightcurve variations with 
the results of the x^ and ANOVA test should be im- 
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Fig. 3. — Histogram of the differential lightcurve of 
a comparison star. In some cases, as in this simula- 
tion which shows a strong skewness, the distribution 
of the differential lightcurves separate from a Gaussian 
curve. 

puted to a relative diminished power of the F test. 

The number of detections of the One-Way ANOVA 
tests is significantly larger than thej^^^ tests. The 150 
observations computed for each run of simulations are 
divided in 30 groups of 5 observations each, and the 
time interval in each group expands 10 min (remember 
that exposures and read-outs last 1 min each). Even 
if the number of groups is reduced to 10 (i.e. sam- 
pling an object every 30 min as in the case of One-Way 
ANOVA with 30 min lags), ANOVA maintains its ro- 
bustness and to a large extent the power to detect mi- 
crovariations. 

4. Discussion 

To perform the ANOVA test it is necessary to bin 
the data. Although the results for the x^ test using the 
standard deviation of the comparison star shows the 
risks of introducing what might be considered a pri- 
ori reasonable changes in the test strict procedure, it is 
still questionable if the x^ test will perform as well as 
ANOVA if data were binned in the same way. How- 
ever, just as there is a loss of information in going 
from a list of observations to an histogram, the bin- 
ning procedure applied to the x^ test will produce an 
immediate loss in the test power Statistical theoretical 
backgrounds to reject the;^'^ binning methodology may 
be traced back at least to lossy compression methods 
considered in Shannon's information theory. Binning 
data has the effect of reducing the signal besides the 
noise, and it is mathematically impossible to get any 
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additional information from the binned data. Thus, an- 
alyzing both the raw signal and the binned data with 
the same statistical procedure will result in a un- 
avoidable loss of sensitivity of the test. 

Things get even worse when the signal is low, close 
to the 3 sigma limit as it is frequent in microvariabil- 
ity studies. Besides blurring the signal, binning data in 
groups of n observations in ax^ procedure has also the 
effect of reducing by a dividing factor of n the degrees 
of freedom of the statistical analysis. Both combined 
effects are disastrous in order to detect tiny variations. 
For the simulations presented in this paper, the raw 
datax^ procedure was able to detect 1614 variations of 
a total of 6000 cases, while an aside calculation based 
on the binned procedure for n - 5 yielded only 284 
detections. In comparison, ANOVA produced 2187 
detections. Although ANOVA also groups the data 
producing a loss of signal, it redistributes de degrees 
of freedom between groups and error estimates within 
groups, rather than canceling them. Thus, if is the 
number of observations and k the number of groups 
(k = N/n), the degrees of freedom are vi = k - 1 for 
the groups and V2 - N - k for the errors, and there- 
fore vi + V2 = - 1, that corresponds to the degrees 
of freedom of the original dataset. Besides, as stated 
above, ANOVA tests (group) means, while x^ tests 
variances, and it is well known that tests for means 
are more powerful than tests for variances, among 
other things, because the actual value for means are 
tighter constrained. This is a result of the squaring 
of each term, which effectively weights outliers and 
large errors more heavily than small ones. For ex- 
ample, a few simulations generating samples of size 
100 drawn from a normally distributed population with 
mean /i = and standard deviation cr - I show that 
the ratio between their respective 95% confidence in- 
tervals is C.I.(cr2)/C.I.(;U) ^ 1.5. 

In the case of ANOVA and ANOVA with 30 min 
lags, the time interval within each group has been cho- 
sen such that it does not exceed 10 min. This limit is 
imposed by previous experience that optical microvari- 
ations in timescales of less than 20 min in quasars 
are rare, hard to detect, or both. On the other hand, 
when the monitoring is performed in several optical 
bands that will be compared later, it should accom- 
plish simultaneity crit eria of variability b etween the 
involved bands. Thus, I Villata et al.l (12004') have con- 
sidered time intervals lasting around 10 min for pho- 
tomet ric sequences between ban ds V and / in bl azars, 
while lPapadakis et al.1 (l2004 and lHu et al] (l2006h have 
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Fig. 4. — Detection frequencies against temporal 
FWHM of the variation for the Gaussian peaks, x^ test 
(•), One-Way ANOVA (A), and One-Way ANOVA 
with 30 min lags (□). 



calculated around 20 and 1 1 min lags between bands 
B and / for AGNs and blazars, respectively. The 
difficulty of detection for even large microvariability 
events lasting less than 10 min has also been shown 
in the results of these simulations, no matter what the 
statistical methodology has been used (see for example 
Fig.lHi. Thus, 10 min is a safe time interval to bin data 
sharing similar flux characteristics, statistically indis- 
tinguishable from the noise, and will be appropriate for 
many studies of quasar microvariability. However, for 
the ANOVA continuous monitoring strategy, it is still 
possible to improve the test power by trying out differ- 
ent bin sizes after the observations have already been 
made (see ^B]l. 

The choice for 1 min exposures has been justified 
in ^ Thus, a group of around 5 such exposures, last- 
ing less than 10 min accounting for CCD read-out, is 
a reasonable methodological choice for the ANOVA 
with 30 min lags observational strategy, for which the 
bin size of the groups is set before the observations and 
cannot be changed after. For a larger telescope, the ex- 
posures might be shorter, but the total number of ob- 
servations in each group is still limited by the read-out 
dead time. Besides, the gain in the power of the test 
would be relatively small for a number of exposures 
larger than 5. 

In the rest of this section only the x^ test, the One- 
Way ANOVA with and without time lags between 
group observations, and the C-statistics will be dis- 
cussed. The results of the simulations presented in ^ 
for the x^ test using the standard deviation of the com- 
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parison star, and the F test for sampled variances, are 
enough descriptive to show the possible risks of rely- 
ing on these procedures. 

4.1. Tests comparison 

It is expected that the probability of detecting a 
change in the brightness of a source depends on both 
the amplitude of the variation and its duration. The 
range of amplitudes is the same for both the Gaussian 
peak and the linear simulations, but the duration of the 
Gaussian peak variations is restricted to a FWHM of 
60 min or less, while the duration of the linear varia- 
tions spreads over the 5 h of monitoring. Therefore it 
is not surprising then that linear variations are detected 
more easily than the shorter Gaussian peaks consid- 
ered in the simulations {see Table|2]i. The effect of the 
duration of the Gaussian peak variations on the number 
of detections is shown in Fig. |4]for;i'^ test. One- Way 
ANOVA, and One-Way ANOVA with 30 min lags. 

Fig. lit compares the distribution of the;^f^ and the 
ANOVA F statistics for the non-variable comparison 
star This plot yields a circular shaped cloud of points 
(allowing a certain amount of distortion since the use 
of different scales), as expected for non-biased statis- 
tics. The points are centered at coordinates (1,149), 
as the expected values for ANOVA F and;^'^ with 149 
degrees of freedom, when the null hypothesis is true, 
are 1 and 149, respectively. Fig. [S]? shows a simi- 
lar plot for the Gaussian peak variation quasar statis- 
tics. The clear correlation between both statistics indi- 
cates that they are measuring the same variable phe- 
nomenon. Note that the maximum range of the 
statistics is about 2 times its critical value, while the 
ANOVA F statistics spreads out approximately 5 or 6 
times from its respective critical value. This is con- 
sequence of the different powers of the tests under 
the conditions of these simulations. The same argu- 
ments apply in the case of the linear variations. In this 
case, the difference in power between the ANOVA and 
the tests can be illustrated from the results for the 
quasar lightcurve shown in Fig. |2] where the ANOVA 
F statistics is larger than the critical value set to detect 
variations {F = 2.8 > 2.3 - while is below 

(y2.164< 208 =4oo,V ' 

Fig.|6]shows the distributions of (a)x^ statistics, (b) 
ANOVA F, (c) ANOVA F with 30 min lags between 
group observations and (d) C-statistics, against ab- 
solute amplitude of the variations for temporal Gaus- 
sian peak variations, along with the critical values for 



each test (indicated by thick horizonal lines). It is 
clear that for (a), (b) and (c) the number of detec- 
tions (points above the lines indicating critical values) 
increases with the variability amplitude. Remember 
that ANOVA, x^ and C-statistics were calculated us- 
ing the same data set, while for One-Way ANOVA 
with 30 min lags the data set is resampled to one third 
of the simulated observations. From Fig. |6}l, it is ob- 
vious that C-statistics is about a factor 2 below any 
detection even though the nominal significance level 
for this test {a = 1%) is less tight than for ANOVA 
and;^^^ tests (a - 0.1%). 

The results for the Unear variations shown in Fig. [T] 
are similar to Gaussian peak variations, but the statis- 
tics are less scattered because there is no effect of the 
length of the variation. For a significance level of 
a - 0.1% and the conditions of the simulations, all the 
linear variations with amplitudes > 0.027 are detected 
by the x^ test. On the other hand, One-Way ANOVA 
detects all the variations with amplitudes > 0.022. In 
comparison, One-Way ANOVA with 30 min lags de- 
tects around 90% variations for amplitudes near 0.03. 
As in the Gaussian peak case, C-statistics is again a 
factor 2 below any detection at the significance level 
of ff = 1%. 

Percentages of detections per amplitude range are 
shown for the Gaussian peak variations in Fig. [8] and 
for the linear variations in Fig. |9l Percentages for the 
Type II errors per amplitude range can be easily de- 
rived from these figures as the subtraction of the per- 
centages of detections from one hundred. In the case 
of the linear variations considered in these simulations, 
the dependence of the number of detections with the 
amplitude of the variations is straightforward. But in 
the case of the Gaussian peak, the double dependence 
on the amplitude and the temporal length of the varia- 
tion makes the relationship less evident. After smooth- 
ing, this double dependence can be shown as a contour 
plot of the probability of detection as a function of the 
amplitude and duration of the microvariability event, 
as shown in Fig. [10] Note that the probability of detec- 
tion increases with both the amplitude and the duration 
of the variation. 

The bulk of all these results attests that both the x"^ 
test and One-Way ANOVA are robust methodologies 
to study variability in the lightcurves of quasars. How- 
ever, the x^ test relies on an accurate theoretical esti- 
mation of the data error /or each single data point. As 
commented in ^ A. 21 this is not usually the case, and a 
number of ad hoc factor corrections have been used to 
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Fig. 5. — Comparison between the;^'^ and the One-Way ANOVA F statistics for (a) reference star and (b) quasar with 
gaussian variations. Thick hnes show the critical values for each statistics. The white cross in (a) marks the expected 
value of both statistics for a non-variable reference star. The distribution of data points in (b) corresponds to 3682 
cases where neither nor ANOVA detect variations (lower left corner), 131 cases where detects variations and 
ANOVA does not (upper left corner), 704 cases where ANOVA detects variations and^^ does not (lower right corner), 
and 1483 cases where both tests detect variations (upper right corner). 



compensate the notorious lack of agreement between 
the IRAF estimated photometric error s and the actual 
dispersion of the data (e.g. [Oo pal-Kri shna et alj|2003 



Stahn et al.ll2004at iBachev et al.ll2005[) . But the large 
increment in the number of Type I errors obtained us- 
ing a sample standard deviation drawn from the com- 
parison star instead of the actual error (Table |2]i, as 
well as arguments given above about the loss of power 
of the test when the data is binned, warns against any 
simple approach to dodge this problem. However, the 
error estimation issue is offset by the internal error es- 
timation of the ANOVA tests. It is reasonably to as- 
sume for the test that an accurate estimate of the 
actual errors for each data point might be achieved by 
measuring the data dispersion for a large number of 
foreground stars. However, we have already discussed 
above that the dispersion measurements such as vari- 
ances and standard deviations are lousy constrained. 
Let us investigate the possible gain in accuracy by us- 
ing a set of comparison stars. 

Actually, the significance level of the x' test when 
the true photometric error of the data is unknown, but 
estimated from one or more comparison stars is very 
easy to calculate: it corresponds to the F statistics. In 
our case, each lightcurve comprises 150 observations 
(v = 149 degrees of freedom) and the nominal signifi- 
cance level of the test is a = 0.1%, which corresponds 
to a critical J = 208. We divide this value by v 



calculate the significance level a for the F statistics 
with VI = 149 and V2 = 149xA^., forF^^J, j4g^ = 1.40. 
Some of these calculations are presented in Table [3] 
Column (1) indicates the number of stars A^, used to 
estimate the error; column (2) is the actual significance 
level a for the test; and column (3) is the relative error 
in the variance estimate. 

Note the value for a single comparison star (A^* = 
1); the actual significance level is 2.12% rather than 
0. 1 %, therefore, for 6000 lightcurve simulations this 
test should produce 0.0212 x 6000 ~ 127 Type I er- 
rors, which agrees with the result of the simulations 
for the test performed using the standard deviation 
of the comparison star (see the second item in Table|2]l. 
Other interesting results reported in Table [3] are that 
combining the data for 10 stars, the actual significance 
level is still almost twice the nominal value, and that it 
is necessary to combine up to 30 or 40 stars to obtain 
a significance level accurate up to 20%. 

Sample variances msx^ distributed random vari- 
ables (equation IA9b and therefore it can be demon- 
strated easily that, in our case, the sample variance for 
each photometric data point obtained from measuring 
Nt stars has also an associated variance (i.e. variance 
of the variance) given by: 

Var[s^] = o-'* 



A^. - r 



to obtain the ret/Mcec/ K critical value = 1.40, and , ■ »u » u . . • j at i 

A ; ' where cr is the true photometric error, and A** - 1 the 
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Fig. 6. — Distribution of statistics against absolute am- 
plitudes of the Gaussian peak variations and critical 
values; (a) Points indicate the values of the;^"^ statistics 
with ori = 0.001, y = 149, and the solid line the criti- 
cal value for the one-sided test;^^^,^,^ = 208; (b) Idem 
for the ANOVA statistics with ai = 0.001, vi = 29, 
V2 = 120 and F.^"'' , = 2.28; (c) Idem for die ANOVA 
statistics with 30 min lags, a\ - 0.001, v\ - 9,V2 = 40 
and F*"'' , = 4.02; (d) Idem for the C-statistics with 

(V1,V2) ' ^ 

= 0.01 and its critical value for the two-sided nor- 
mal test Zar, = 2.576. 



Fig. 7. — Distribution of statistics against absolute 

amplitudes of the linear variations and critical values; 
(a) Points indicate the values of the statistics with 
ai = 0.001, y = 149, and the soUd line the critical 
value for the one-sided test xl,-n - 208; (b) Idem 
for the ANOVA statistics with ai = 0.001, vi = 29, 
V2 = 120 and F<"'> , = 2.28; (c) Idem for die ANOVA 
statistics with 30 min lags, a\ - 0.001, v\ -9,V2- 40 
and f'."'^ . = 4.02; (d) Idem for the C-statistics with 

(Vl,V2) ' ^ 

02 = 0.01 and its critical value for the two-sided nor- 
mal test = 2.576. 
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Fig. 8. — Percentage of detections per amplitude of 
Gaussian peak variation. ia)x^ statistics; (b) ANOVA; 
(c) ANOVA with 30min lags. 



Fig. 9. — Percentage of detections per amplitude of 
linear variation, (a) statistics; (b) ANOVA; (c) 
ANOVA with 30 min lags. 
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Table 3: Effect of the number of 
comparison stars on the test accuracy. 



Number 






of stars 


a 


Err( s2)/o-2 


1 


0.0212 


1.4142 


5 


0.0029 


0.6325 


10 


0.0018 


0.4472 


20 


0.0014 


0.3162 


30 


0.0012 


0.2582 


40 


0.0012 


0.2236 



degrees of freedom. Thus the error of the measured 
variance is expressed by the square root of the previous 
equation: Err[i^] = cr^ y/l/iN^ - 1)- For observations 
analogous to the simulations reported in this paper, 
cr ~ 0.01 mag. The relative error for the variance es- 
timate (En:[s^] / cr^) obtained from Nt star is shown in 
the third column of Table |3] Even for A^* = 40 the vari- 
ance estimate has an accuracy worse than 20%, which 
implies that the photometric error estimate is inaccu- 
rate by approximately 50% (s = 0.010 ± 0.005). But 
even in the case that it were possible to observe tens 
of stars simultaneously, attempting to meet the con- 
trolled conditions of the simulations where the errors 
are completely known, ANOVA's statistical power per- 
forms better than the test to find tiny variations in 
the quasar simulated differential lightcurves. Although 
the actual differences in power may vary depending on 
the test implementation and lightcurve characteristics, 
or even the actual observational methodology, the ef- 
fort of combining the lightcurves of tens of foreground 
stars may be irrelevant in most cases. 

Another concern with measuring errors using a 
large sample of suitable foreground stars is that it is 
not always possible. In fact, the number density of 
bright stars around quasars that can be used for dif- 
ferential photometry studies is small. From a photo- 
metric point of view, quasars are blue color objects. 
For the most reliable differential photometry, at least 
in the most blue optical bands, quasars should be com- 
pared preferable with either nearby white dwarf stars 
or Main Sequence bluish stars to avoid color effects 
that may arise at different air-masses. Besides, quasars 
are observed at high galactic latitudes and thus, un- 
less the telescope field is large enough, the number 
of foreground stars around quasars is usually scarce. 
Moreover, many of them may be old, low luminosity 
stars hanging around the thick galactic disk or roam- 
ing through the galactic halo. Then, many of the most 
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Fig. 10. — An example of the double dependence of 
the Gaussian peak detections on the amplitude and the 
temporal length of the variation. The smoothed 
statistics of the simulated data is shown as a contour 
plot of the probability of detection with FWHM of 
the variations in minutes in the X axis, and the ampli- 
tude in magnitudes in the Y axis. Plots for One-Way 
ANOVA statistics would show a similar aspect. 

brightest foreground stars will be luminous Pop 11 red 
giants and evolved, fast period variable stars. Although 
the effect of observing objects with moderate color dif- 
ferences at low air-masses is negligible in broad band 
studies, red giant stars should be avoided in differential 
optical photometry of quasars. Fast period variables 
are of course unsuitable as reference and comparison 
stars. Therefore, around most quasars there are only a 
few useful nearby, not too red and non-variable stars 
bright enough to be used for comparison purposes. 

Another interesting subject is the possibility of de- 
tecting very fast variations. It has been commented 
above that the statistical power of the tests is limited by 
both the duration of the microvariability event and its 
amplitude. For the observational parameters consid- 
ered in these simulations, the effort to detect microvari- 
ations lasting less than »30 min and with amplitudes of 
less than 0.02 mag would be very inefficient. The rela- 
tively large number of spikes reported in microvari- 
ability literatu re (Sagar et al. 1996; de Diego e t al.l 
1 19981: ICiopal-Krishna et al.l bood istahn et al.H2004al) 
suggests that they may be a common phenomenon and 
worth of investigation using very fast and accurate 
photometry with large telescopes. 

The One-Way ANOVA with 30 min lags pro- 
cedure is analogous to the methodology used by 
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de Diego et alj (Il998h . Simulations show that this 
procedure is also robust, but the smaller number of 
points in each data set (50 observations rather than 
150) due to the discontinuity in the sampling affects 
the number of detections, which is about one half of 
the One-Way ANOVA continuous sampling. But in 
the real world, the final result will depend on the actual 
timescales of the microvariations. Stalin et al. ( 2004ah 
report three kind of variations, namely small gradual 
variations lasting over several hours, time resolved mi- 
crovariability on hour-like timescales, and single-point 
fluctuations (spikes). In this case, One-Way ANOVA 
with 30min lags will perform almost as well (within 
a 20%) as the continuous sampling detecting the first 
two types of microvariations. Therefore, as this tech- 
nique permits to monitor other target fields during the 
gaps between the end of a group of observations and 
the beginning of the next group, it turns out to be a 
very powerful exploratory methodology to detect mi- 
crovariations. For example, in the One-Way ANOVA 
with 30min lags case each group of 5 observations 
lasts lOmin (accounting for exposures and read-outs), 
and there is a gap of 20min until the first exposure 
of the next group begins. Depending on the telescope 
setup, it would be possible to switch to at least one, 
maybe two near different target fields, increasing the 
number of monitored objects by a factor two or three. 

The results for ANOVA and tests discussed 
above do not demonstrate that, under all circum- 
stances, One-Way ANOVA tests perform better than 
a well implemented test. It is possible to sustain 
that similar results to those presented in this paper 
would have been obtained in the case, for example, of 
lightcurve shapes not considered here, such as saw- 
tooth variations. However, there are many possibilities 
of test implementation and observational techniques. 
One of the requirements of ANOVA tests is the ho- 
mogeneity of variances, that is, errors should be dis- 
tributed equally in all the groups. ANOVA is robust 
against moderate violations of this requirement, and 
thus this is not a big concern when comparing data 
obtained with the same telescope and equipment, and 
during the same night under ordinary atmospheric con- 
ditions. However, combining data from a couple of 
telescopes with different characteristics may ruin the 
One-Way ANOVA' s performance. 

4.2. Remarks on the C-statistics results 

Proves on the reliability of the C-statistics method- 
ology have confirmed the severity of the problems 



pointed out in ^A.4I This methodology has been 
widely accepted as a standard test, and thus it is not 
surprising the confusion generated in the research of 
microvariability in quasars. Several researchers have 
used C-statistics to study messy samples of radio quiet 
and radio loud quasars altogether with blazars. For 
example, iRomero et al.i (Il999h include several BL Lac 
objects, accounting for nearly 70% of their microvari- 
ability detections. In most cases, only blazar-like ob- 
jects will show variations extreme enough to lie above 
the proposed C-statistics critical value. A detailed dis- 
cussion about the relevance of a carefulsample selec- 
tion can be found in lRamfrez et al.l (l2009l) . 

In the case of high a ccuracy pho tometry, as the 
observations presented in iRomero et al. (JL999) (cr^ ~ 
0.001 mag), maybe rare microvariability events could 
be detected by the C-statistics also in non blazar ob- 
jects. For example, iGupta & Josh i (2005) detect mi- 
crovariation events if the errors are about 0.005 mag, 
but usually fail to detect if errors are 0.01 mag or 
above. Note that this result is in accordance with those 
presented in ^ in the sense that for the accuracy of 
0.01 mag considered in the simulations, C-statistics is 
always a factor > 2 below its critical value. 

There is contradictory evidence that some RQQ 



may have a weak blaz ar component (ICzernv et al 



20081; IChand et al.ll2009l) which might explain at least 
some reported microvariability detections. If this 
blazar component is present, it might account for flux 
fluctuations in RQQs above 0.05 mag, and the extreme 
variations a bove 0. 1 mag reported for PG002 6-H129 
(|jandl2005b and US 995 (d e Diego etalll 19981) (how- 
ever, note that large amplitude variations are not con- 
sidered in the simulations presented in this paper). 

When a more appropriate methodology is used, the 
results converge in the sense that microvariability in 
radio quiet quasars is also a common phenomenon. 
Thus. lGupta & Joshil (l2005h used the F test and found 
that 2, probably 3 radio quiet quasars out of a small 
sample of 6 present microvariations (i.e. between 30 
and 50%), but when they mix their results with those 
obtained by other groups that use C-statistics, the to- 
tal numbers are 6, possibly 14 objects that vary out 
of 49 (i.e. between 12 and 29%). On the other hand. 



Stalin et al.l (120051) separate BLLacs in their sample 



and, using a combination of the C-statistics and the 
structure function find that the Duty Cycle for radio 
loud and radio quiet quasars are not statistically differ- 
ent (18 and 22% respectively). Th is result is a l so anal- 
ogous to the value calculated by IStalin et al.l (l2004al) 
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for the Duty Cycle inferred from lde Diego et alj ( 119981) 

(-25%). 



5. Conclusions 

In this paper, several statistical methodologies and 
their variants for microvariability studies in quasars 
have been tested. From these techniques, One-Way 
ANOVA andx^ tests have shown to be the most robust 
statistics. However, some 'reasonable tests vari- 
ants, particularly those that estimate errors from the 
Ughtcurves of comparison stars rather than rely on ac- 
curate physical models to characterize the data qual- 
ity, are much less robust. These variants should 
be avoided, as they can be substituted easily by other 
robust tests, or at least they should be studied with 
detail and using simulations to understand their lim- 
itations before they are used in statistical analysis. 
The error estimation problem does not affect the One- 
Way ANOVA statistics, that reaches the highest per- 
formance to detect variations in the simulated differ- 
ential lightcurves of quasars presented in this paper. In 
fact, the results present in this paper for the One-Way 
ANOVA test can be further improved by trialling dif- 
ferent group sizes {see ^Bj. On the other hand, a dis- 
continuous sampling based in the One-Way ANOVA 
methodology proves to be a powerful exploratory tech- 
nique to detect microvariations. The relative loss of 
power of the discontinuous monitoring depends on the 
samphng frequency, but it is possible to tradeoff be- 
tween this sampling frequency and an increase of the 
number of monitored objects switching target fields 
during the gaps between the groups of observations, 
resulting in a larger total number of detections of vari- 
ability. 

The F test for variances has less power than One- 
Way ANOVA and tests, but it is still a valid option 
to detect flux variations in AGNs. However, the C- 
statistics contains several misconceptions and cannot 
be considered a true statistical test. 
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A. Tests description 

A.l. F test for sampled variances 

Given two sample variances such as for quasar differential lightcurve measurements and si for the comparison 
star, the number of degrees of freedom for each sample, vq and v, respectively, will be usually the same and equal to 
the number of measurements less one (v = - 1). Then, if there are no differences in population variances, the 
sample variance ratio is distributed as the F distribution with vq and v* degrees of freedom: 



,.2 



(Al) 



This F statistics is compared with the F'f^y^ critical value, where a is the significance level set for the test, and 
Vq and v» the degrees of freedom of the quasar and the star of comparison samples. The smaller a value, the most 
improbable that the result is produced by chance. Thus a value a - 0.1%) or 1%, as assumed in this paper, roughly 
corresponds to a 3cr or a 2.6cr detection, respectively. If F is larger than the critical value, the null hypothesis (no 
variability) is discarded. In this paper, the F test has been performed at two significance levels (0. 1 % and 1 %) to allow 



(0.001) 



= 1.6651 and Ff^g^ = 1.4666. 



comparison with other statistical procedures. The respective critical values are F' 

A well known problem that may affect the outcomes of the F test for variances is that it is non-robust to non- 
normality dLehmannll 19861 §5.4). 



A.2. test for variance in a normal population 

Given a number of observations of a source over a given period of time, these observations are supposed to be 
taken from a population of possible observations having a normal distribution. For this sample, the mean magnitude 



is y, the ith observation yields a magnitude V, and the corresponding standard error cr,. 
expressed by: 



Then, the x statistics is 



X 



= Z 



C7 



(A2) 



This statistics is compared against a critical value Xa v obtained from the x^ probability function, where a is the 
significance level and v = - 1 are the degrees of freedom. If > Xav 'he test indicates a larger than expected 
scattering of the data points (i.e. evidence of variability). Each simulation comprises A^ - 150 observations, and thus 
V = 149. Therefore, the critical value adopted for this test is Xq ooi 149 - 208. 



Pica & SmithI (Il983h . iHeidt & WagiieJ (Il996l) . |GoEal-Krishna et al.' (lOOd), 'Andruchow et al.' ('2005') (see refer- 
ences therein) and others follow the x^ test proposed bv lPenston & Cannon (.1970.) and Kesteven et al.. (.1976,) . This 
procedure uses a 'weighted' average defined by: 



y = ^/^' 



(A3) 



Note that cr, is the expected error, i.e. the error from considering photon noise from the source and sky, the CCD read- 
out and all possible non-systematic sources of error, some of them probably unknown in practice. As these individual 
errors are unknown, different estimates s, are used instead of cr, . Thus, errors are often calculated from the usually 
underestimated value yielded by the IRAF reduction package, multiplied by a correction factor. This error rescalation 
is necessary because and other tests (but not ANOVA) assume that the distribution of the real errors is known. 

For ex ample, Ba chev et all (l2005b use a factor of 1.3;lS talin et all ( l2004aB . '2005*) 'Gopal-Krishna et al.l j2003h and 
Gupta et al . (2008^ in the near infrared) use 1.5: iGarcia et a l. (199^) 1.73; and Gopal-Krishna et al. (19951) 1.75. In 
fact, standard IRAF phot ometric packages do no t take into account appropriate propagation of errors during the image 
processing. Particularly, Ide Diego et al.l (119981) have argued that these IRAF packages do not consider the possible 
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spurious enhancement of the S/N ratio after the flat-field correction, due to changes in the sensibility across the detector 
(for example, sensibility near the borders may be lower than at the center), and that the error estimated directly from 
the corrected signal may be different from the original error Note that this aff'ects only the error estimation and not 
the measured level of the flux. Up to now, there is no astronomical reduction software that adequately deals with 
this problem, and implementing a solution may be not a trivial issue. ANOVA (see ^A.3b overcomes this problem 
by empirically measuring the dispersion within each group of observations (directly associated to the true S/N ratio), 
while data means for each group are conserved by flat-fielding the images. 

Usually, images of the object and the comparison stars (one or more) are recorded in the same CCD frames. 
However, combining several comparison stars to estimate the individual errors cr, in equation I A2 1 does not solve the 
problem. Here it is worth noting that;^'^ with v degrees of freedom distributes as Fy oo. Thus, even if cr, is obtained 
using several comparison stars to get a better estimate of the errors, the precision is still limited by the number of these 
stars, which probably are less than a dozen. In contrast, the test requires that each cr, would be calculated from all 
the possible measurements of an infinite population of stars. This issue is exemplified in ^4.11 

Some authors substitute the individual errors by a common error cr estimated from the dispersion in the comparison 
star data. This is the standard procedure used when performing the C-statistics {see ^A.4l l. However, in differential 
photometry the number of images of the object and of the comparison stars is the same, and therefore the estimates 
for the standard deviation of their lightcurves have the same number of degrees of freedom. Then, the test is biased 
because it takes only into account the degrees of freedom of the estimatio n for the qua s ar T herefore, the correct 
procedure is to consider the F test for two sample variances as proposed by iHowell et al.l (Il988h . In conclusion, any 
procedure that cannot rely on theoretically known (not estimated) error values, compromise the reliability of the x^ 
test. Whenever the true errors are unknown, other statistical methodologies should be used. 

In the case of the simulations presented in this paper, all the parameters are controlled and the true errors can be 
effectively computed. Equation (IA2b has been used to compute the x^ statistics. Note that data quality is ensured by 
the simulation design, thus there is no need to use weighted averages. 



A.3. One-way ANOVA test 

ANOVA tests are used to compare the means of a number of samples. Due to the Central Limit Theorem, no matter 
what the shape of the original distribution is, the sampling distribution of the mean approaches a normal distribution. 
Therefore, tests to compare the means (f-test and ANOVA) are robuster than their counterparts to compare variances 
ix^ and F tests). This offsets the problem of the non-robustness of the F test. 



One-Way ANOVA has been applied by Ide Diego et al.l (119981) to investigate the variability in the lightcurves of 



quasars. The methodology consisted in measuring k groups of rij - 5, one after the other, short (1 min) observations. 
The k groups are ideally separated by 20 - 30 min. Larger time lags are common, affecting the time resolution but not 
the statistical significance of the test. 

For the mathematical description of the One-Way ANOVA test, if y,j represents the /th observation (with / = 
1, 2, ...nj) on the y'th group (with j - 1,2, ...k), the linear model describing every observation is: 



where y represents the mean of the whole data set, gj -jj-y the between-groups deviation, and e,y = y,y - y, the 
within-groups deviation, also called residual or measurement error. The size of the data set will be = Y!1=\ If the 
number of observation in the groups rij is constant, N - kxrij. 

As commented previously, ANOVA tests whether the means of the groups are equal. The condition tested or null 
hypothesis is that the means of the different groups are equal. If the test yields a probability smaller than the adopted 
significance level a, the null hypothesis will be rejected and the alternate hypothesis (at least one group mean is 
different from the others), will be accepted. The alternate hypothesis in this case implies detection of variability in the 
quasar lightcurve. 

From equation (IA4I) . the total sample variation can be separated into variations between and within groups 
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k ly k k ij 

Yp'j -y^'- Yiyi - + Z Ijy'i - yj'^'^ (^^^ 

where the term in the left side describes the total deviations of the data with respect to the mean. The firs term in the 
right side of the equation represents the total variation between groups, and the last term the total errors. Equation ( |A5| | 
can be shortened to: 

SSt ^SSg+SSr (A6) 

where SSt stands for the total sum of squares. Similarly SSc and SSr stand for group sum of squares and residual 
sum of squares. 

Whenever the null hypothesis is true, the k groups of sampled data will be normally and independently distributed, 
with mean /i and variance cr^. Then, the statistics: 

SSc/(k-l) ^ MSc 

SSR/(N-k) MSr' ^ ' 

corresponds to the F distribution with v\ - k - \ and V2 - N - k degrees of freedom. The pseudo variances MSq and 
MS R are mean estimates for the variations between groups and residuals, respectively. For a certain significance level 
a, if F exceeds the critical value /^v'v, '^^e null hypothesis will be rejected. 
The F critical values employed in this paper for the ANOVA tests are 

^S no = 2.2819 and Ff^^^ = 4.0243 for 

the full and the 30 min lags sample simulations, respectively. 
A.4. C-statistics 

C-statistics was first employed bv lJang & Miller! (119971) and generalized by iRomero et al. (Il999h . The statistical 
parameter used is 

(Tt 

C = — (A8) 
cr 

where ctj and cr are the standard deviation of the quasar and the comparison star differential lightcurves. The adopted 
variability criterion requires that C > 2.576 which corresponds to a 99% confidence level, or 1% significance level 
following the notation in this paper 

There are two pitfalls with this criterion. First, the critical value 2.576 corresponds to the 1% significance level of 
the normal distribution for a two-sided test, rather than a one-sided comparison. The two-sided test would be relevant 
to test that the dispersion in the quasar lightcurve may be both, larger or smaller, than the dispersion of the comparison 
star lightcurve. Note also that C-statistics would always be positive, and that its expected value when crj- = cr is 
centered around 1, rather than as would be expected in the case of a fair normal distribution. 

Another pitfall, probably the most important, is that you cannot compare two standard deviations using the normal 
distribution. In fact, it is unfeasible to use standard deviations for most calculations because they are not lineal 
statistical operators (for example, given two independent random variables A and B with standard deviations cta and 
erg, respectively, the standard deviation of the sum A + B is not cr^ + cr^). That is the reason because you have to use 
variances instead; variances are the second moments of the statistical distributions and therefore lineal operators (in 
the previous example, the variance of the sum A + B is cr^ + cr|). 

If we draw all possible samples of a given size from a normally distributed population and compute the variances 
of all those samples, we will obtain a distribution of sample variances that starts with s^ - Q and have a mean of 
cr^. Thus, even if the distribution of all possible sample means drawn from a normally distributed population will 
be approximately symmetrical (as a consequence of the Central Limit Theorem), an equivalent distribution of sample 
variances will not approximate symmetry, but will be distributed as with - 1 degrees of freedom (this result is a 
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consequence of the Cochran's Theorem for the sum of squares of linear combinations of a set of independent standard 
normally distributed random variables): 



x'^iN-l)^ (A9) 



One important outcome of this equation is that the the shape of the distribution will be different for different 
sample sizes. Thus, the dispersion of the distribution of sample variances will depend on how many degrees of freedom 
has our estimate. Note that this dependence of the variance dispersion on the number of degrees of freedom transmits 
to the standard deviation, although the;i'^ statistics does not apply in this case. 

Using the F statistics it is easy to calculate the 'real' significance of the C-statistics. If the critical value from 
equation dASb is 2.576, the critical F value is 2.576^ = 6.636. For F with (20,20) and (50,50) degrees of freedom, 
the significance level of the t est wi ll be 4.4 x 10"^ and 2.1 x 10"^, respectively; i.e. much less than the 10"^ value 
considered by Jang & Millen ( Il997h and lRomero et al. (Il999h . Note that Romero et al. report that variations in radio 



loud quasars occur more often than in radio quiet quasars, but most sources in their radio loud sample are strongly 
variable BL Lac objects that can easily reach high significance levels of microvariability detection. 

To summarize this discussion, the critical values for the C-statistics criterion are wrongly established, and equa- 
tion (IA8I ) does not describe a normal distributed variable because it is neither properly centered such as the mean 
expected value is zero, nor does the independent sample size parameter cr represent an unbiased measurement of the 
dispersion of the sample standard deviation. The square value of the C-statistics can be used to perform an unbiased 
F test if the degrees of freedom used for the (Tj and cr estimates are provided. 



B. Improving the ANOVA power 

The power of any statistical test to study variability during a given time interval depends on the number of observa- 
tions recorded to improve the time resolution, and the measurement precision to improve the amplitude resolution. In 
practise, a compromise is attained to hold reasonable resolutions for both factors before actually observing the target. 
Would not be useful to have the possibility of swapping between the time and amplitude resolutions after the observa- 
tions? Binning data might help, but it is shown in ^that it does not work for ?ix^ methodology because the loss of 
degrees of freedom. 

In the ANOVA methodology, if the target has been monitored discontinuously, as in the case of ANOVA with 30 min 
lags, the binning is fixed and consequently the time and amplitude resolutions cannot be exchanged either But if the 
monitoring was continuous, it is possible to try out different bin sizes to improve the amplitude resolution at the expense 
of the time resolution. In this paper, a binning of 5 observations was used because it was a priori reasonable value 
for time resolution and allowed direct comparison with the ANOVA with 30 min lags strategy. Now, different bins of 
n observations will be considered to achieve the maximum power of the ANOVA test for the lightcurve simulations 
considered in this paper The bin size n will always be a divisor of the total number = 150 of observations of a 
hghtcurve simulation. 

Table|4]shows the ANOVA results for the Gaussian peak variations considering different bin sizes strategies. Col- 
umn (1) indicates the bin size n; column (2) shows the number of degrees of freedom v\ for the k groups (vi - k - \); 
column (3) displays the number of degrees of freedom V2 for the residuals (v2 - N-k); column (4) shows the F critical 
value for a significance level a - 0.1%, and v\ and V2 degrees of freedom {Ff^'^^^\ column (5) indicates the number 
of Type I errors ; column (6) shows the number of detections of variability; and column (7) displays the relative power 
for the ANOVA tests normalized to the results for n = 15. 

A careful inspection of Table |4] shows that, for the simulations presented in this paper, the test power to detect 
variations increases with the bin size until reaching a maximum of 2478 detections for n = 15 (noted in boldface in 
TableHI, and then decreases for larger bin sizes. The smooth overall detection tendency shows that the differences in 
power for these tests with different bin sizes are not an artifact. For n - 5 the relative power is almost 0.9, and all the 
bins between n - 5 and 50 have relative power larger than 0.8. These results show that the ANOVA bin size is not a 
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Table 4: ANOVA performance for different bin sizes. 



Gaussian peak variation Relative 



n 


Vl 


V2 


p(o.ooi) 


Type I 


Detections 


Power 


2 


74 


75 


2.0657 


2 


1178 


0.48 


3 


49 


100 


2.0835 


5 


1713 


0.69 


5 


29 


120 


2.2819 


3 


2187 


0.88 


6 


24 


125 


2.3907 


6 


2268 


0.92 


10 


14 


135 


2.8199 


4 


2444 


0.99 


15 


9 


140 


3.3374 


1 


2478 


1.00 


25 


5 


144 


4.3617 


3 


2285 


0.92 


30 


4 


145 


4.8882 


7 


2143 


0.86 


50 


2 


147 


7.2428 


5 


1996 


0.81 


75 


1 


148 


11.2727 


4 


430 


0.17 



fine tuning parameter, and that binning the data in groups of 5 observations is, also from the statistical point of view, a 
reasonable choice. 

The smallest bin sizes improve the time resolutions, but the detections are biased towards large amplitude variations. 
On the contrary, the largest bin sizes can detect small variations, but the time resolution is degraded and fast variations 
are not detected. Thus, the tests for n - 5 detect 65 variations with time FWHM up to 13min, while the tests for 
« = 15 detect only 37. But the percentages of detections for an amplitude of variation of approximately 0.01 mag are 
around 10% and 20% for the n- 5 and n = 15 tests, respectively. 

In the case of linear variations, or more generally monotonic variation, the results are simpler. As the duration of 
the variability event is not an issue in this case, the only factor that matters is the amplitude resolution. Therefore, 
the tests that have more power are those with only two groups, one for each half of the lightcurve, that corresponds to 
n = 75, Vl = 1 and V2 = 148. The actual number of detections for this test is 4080, instead of 3043 detections that 
were obtained for n - 5. 

These results are easily translated to the case of analyzing a single quasar lightcurve. The researcher does not know 
in advance the specific characteristics of a possible microvariability event. But planning the observations to be analyzed 
using an ANOVA experimental design, the astronomer can enhance the temporal resolution of the observations to 
detect very fast large amplitude variations, and still conserve the amplitude resolution to detect longer timescale low 
amplitude variations. Of course, this has also a cost. The final significance level aj when performing a set of n, 
statistical tests is different from the significance level a, for the individual tests. There are a number of adjustments 
applied in the statistical literature for multiple tests, the most common is the Bonferroni correction: 

a, = a fin, 

If the set of tests performed are the same as in the case presented above for the simulations, n, - 10 and a, - 0.1%, 
the final significance level will be a/ = 1%. Similarly, the significance level of the individual tests could be set to 
a, - 0.01% to reach a final value of aj - 0.1%). However, the results shown in Table |4] suggest that performing test 
for bins of size 5, 15 and 30 may be enough to ensure detection on a wide range of timescales and amplitudes. If only 
three tests are performed, the significance levels are aj - 0.3% (if a, is set at 0.1%) or a, - 0.03% (if aj is maintained 
at 0.1%). 
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